If your prior is uniform, it applies the same multiplier to each possible value of \(\theta\) (your given parameter). So…
\[
\text{Posterior} \propto \text{Likelihood}
\]
Uniform Priors
Pros:
“Objective” 🙄
let’s the “data speak for themselves”
easy to explain
sometimes gives results roughly equivalent to Frequentist estimates
Cons:
you often know something (non-regularizing)
non-invariance under re-parameterization
improper over \(\mathbb{R_n}\)
Uniform Priors (Freq)
library(brms)# freqlm(y ~ x, data = df) |>summary()
Call:
lm(formula = y ~ x, data = df)
Residuals:
Min 1Q Median 3Q Max
-2.11461 -0.74886 0.00784 0.63176 2.12521
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.11802 0.09295 1.270 0.207
x -0.47841 0.08475 -5.645 1.61e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9294 on 98 degrees of freedom
Multiple R-squared: 0.2454, Adjusted R-squared: 0.2377
F-statistic: 31.86 on 1 and 98 DF, p-value: 1.61e-07
Uniform Priors (Bayes)
# bayespriors <-c(prior("", class ="b"),prior("", class ="Intercept"),prior("uniform(0, 1e6)", class ="sigma"))brm(y ~ x, data = df, prior = priors, verbose =2) |>summary()
Family: gaussian
Links: mu = identity; sigma = identity
Formula: y ~ x
Data: df (Number of observations: 100)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.12 0.09 -0.07 0.31 1.00 3583 2754
x -0.48 0.09 -0.64 -0.31 1.00 3460 2741
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.94 0.07 0.82 1.09 1.00 4224 3164
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Now, let’s look at what Parameterization 2 (uniform prior on \(e^\beta\)) implies about \(\beta\):
Jeffrey’s Prior
\[
p(\theta) \propto \sqrt{I(\theta)}
\]
Jeffrey’s Prior: method of constructing non-informative priors that are invariant under re-parameterization
(see Bayesian Data Analysis 2.8 for more math)
Fisher Information
Slightly Mathy Idea: Fisher Information measures how sensitive the log-likelihoodfunction \(\ell(\theta | X)\) is to changes in \(\theta\) (more sensitive \(\to\) more information)
where \(\ell(\theta | X)\) is the log-likelihood of \(\theta\) given \(X\). If \(\ell\) is sensitive to changes in \(\theta\), the second derivative should be large and we expect to see high information
Note: usually \(\ell\) is concave down around maximum likelihood estimate, meaning that the second-derivative will be negative, hence the negative sign in front of the expectation
Uniform Priors (Regularization)
Regularization (in general): discourages overfitting, makes models simpler
Regularizing Priors: keeps parameter values in a “reasonable range”
Uniform Priors (Regularization)
Ridge = Normal Prior, Lasso = Laplacian Prior.
Other Types of Priors
Uniformative: all possible values for \(\theta\) have the same relative prior likelihood
Weakly Informative: regularize our estimates, extreme values are less likely, but not much other info
Strongly Informative: prior is narrow around known likely values
Other Types of Priors
Uniformative: all possible values for \(\theta\) have the same relative prior likelihood
Weakly Informative: regularize our estimates, extreme values are less likely, but not much other info
Strongly Informative: prior is narrow around known likely values
When would you use each type of prior?
Choosing Prior Distributions
think about the range of possible values of \(\theta\)
think about amount of prior uncertainty
think about regularization
In reality, we often report 2-3 different versions of the models with various priors to see how these priors change the inferences we make.
Setting Priors in brms
set.seed(540)n <-100x <-rnorm(n)y <-0.25- x*0.5+rnorm(n)df <-data.frame(x,y)# bayespriors <-c(prior("normal(0,5)", class ="b"),prior("normal(0,4)", class ="Intercept"),prior("gamma(0.1, 10)", class ="sigma"))brm(y ~ x, data = df, prior = priors, verbose =2) |>summary()
Family: gaussian
Links: mu = identity; sigma = identity
Formula: y ~ x
Data: df (Number of observations: 100)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.25 0.09 0.07 0.44 1.00 3900 3130
x -0.49 0.10 -0.70 -0.29 1.00 3793 2966
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.98 0.07 0.86 1.12 1.00 3827 2876
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).